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Abstract 



In many applications we have both observational and (randomized) interventional data. 
We propose a Gaussian likelihood framework for joint modeling of such different data-types, 
based on global parameters consisting of a directed acyclic graph (DAG) and correponding 
edge weights and error variances. Thanks to the global nature of the parameters, maximum 
likelihood estimation is reasonable with only one or few data points per intervention. We 
prove consistency of the BIG criterion for estimating the interventional Markov equivalence 
class of DAGs which is smaller than the observational analogue due to increased partial 
identifiability from interventional data. Such an improvement in identifiability has imme- 
diate implications for tighter bounds for inferring causal effects. Besides methodology and 
theoretical derivations, we present empirical results from real and simulated data. 

Keywords: Causal inference; Interventions; BIC; Graphical model; Maximum likelihood 
estimation; Greedy equivalence search 

1 Introduction 

Causal inference often relies on an underlying influence diagram in terms of a directed acyclic 
graph (DAG). In absence of knowledge of the true underlying DAG, there has been a substan- 
tial line of research to estimate the Markov equivalence class of DAGs which is identifiable from 
data. Most often, the target of interest is the observational Markov equivalence class to be 
inferred from observational data; that is, the data arises from observing a system in "steady 



state" without any interventions, see for example the books by Spirtes et al. (2000) or Pearl 



(2000). For the important case of multivariate Gaussian distributions, the observational Markov 



equivalence class is rather large and thus, many parts of the true underlying DAG are uniden- 



tifiable from observational data, see for example Verma and Pearl (1990) or Andersson et al. 



(1997) for a graphical characterization of the Markov equivalence class in the Gaussian or the 
fully nonparametric case. Under additional assumptions, identifiability of the whole DAG is 



guaranteed as with linear structural equation models with non-Gaussian errors (Shimizu et al. 



2006 


) or additive noise models ( 


Hoyer et al. 


2009 


), see also 


Peters et al. ( 


2011 



In many applications, we have both observational and interventional data, where the latter 
are coming from (randomized) intervention experiments. In biology, for example, we often 
have observational data from a wildtype individual and interventional data from mutants or 
individuals with knocked-out genes. Besides the methodological issue of properly modeling such 
data, we gain in terms of identifiability: the interventional Markov equivalence class is smaller 



(Hauser and Biihlmann 2012), thanks to additional interventional experiments, and this is of 
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particular interest for the Gaussian and nonparametric cases which are hardest in terms of 
identifiabihty. 

We focus here on the problem of joint modeling of observational and interventional Gaussian 
data. Thereby, we assume that the observational distribution is Markovian (and typically faith- 



ful; cf. Spirtes et al. 2000 ) to a true underlying DAG Dq and that the different interventional 
distributions are linked to the DAG Dq and the observational distribution via the intervention 
calculus using the do-operator (Pearl, 2000). Linking all interventional distributions to the 



same DAG Dq and the single observational distribution allows to deal with the situation where 
we have only one interventional data point for every intervention target (intervention experi- 
ment). We propose to use the maximum likelihood estimator which has not been studied or 
even used for the observational-interventional data setting. We prove that when penalizing with 
the BIG score, it consistently identifies the true underlying observational-interventional Markov 
equivalence class. 



1.0.1 Relation to other work 

Some approaches to incorporate interventional data for learning causal models have been de- 
veloped in earlier work. Cooper and Yoo (1999) and Eaton and Murphy (2007) address the 



problem of calculating a posterior (and also a likelihood) of a data set having observational as 
well as interventional data but do not investigate properties of the Bayesian estimators e.g. in 
the large-sample limit nor address the issue of identifiability or Markov equivalence. He and 



[Geng| (|2008j) present a method which first estimates the observational Markov equivalence class 
and then in a second step, it identifies additional structure using interventional data. This tech- 
nique is inefficient due to decoupling into two stages, especially if one has many interventional 
but only a few observational data: in fact, our maximum likelihood estimator in Section [3] can 
cope with the situation where we have interventional data only. To our knowledge, no analysis 
of the maximum likelihood estimator of an ensemble of observational and interventional data 
has been pursued so far. The computation of the maximum likelihood estimator which we will 



briefly indicate in Section 4.2 has been developed in Hauser and Biihlmann (2012): due to its 



non-trivial nature, it is not dealt with in this paper. When having observational data only, the 
works by Ghickering ( 2002a|b ) are dealing with maximum likelihood estimation and consistency 
of the BIG score for the corresponding observational Markov equivalence class: however, the 
extension to the mixed interventional-observational case, which occurs in many real problems, 
is a highly non-trivial step. 



2 Interventional-observational data and maximum likelihood es- 
timation 

We start by presenting the model and the corresponding maximum likelihood estimator. 



2.1 A Gaussian model 

We consider the setting with riobs observational and n[^t interventional p-variate data from the 
following model: 

^int ' • • • ' Mnt'^ independent, and independent of , . . . , , X^^J ~ Pj^ • (1) 

In the following, we specify the observational distribution Pobs and all the interventional distri- 
butions Pj^l (i = 1, . . . , nint). 



Regarding the observational distribution, we assume that 

-Pobs = A/'p(0, S), where Pobs is Markovian with respect to a DAG D. 



(2) 
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The assumption with mean zero is not reaUy a restriction: ah derivations can be easily adapted, 
at the price of writing an intercept in many formulas. An implementation in the R-package 



pcalg (Kalisch et al. , 2012) offers the option to restrict to mean zero or not. The Markovian 



assumption is equivalent to the factorization property in ([s]) below. We sometimes refer to the 
true observational distribution as Po.obs with parameter Sq, and the true DAG is Dq. 

In the following, the set of nodes in a DAG D, associated to the p-dimensional random vector 
{Xi, . . . , Xp), is denoted by {1, ■ ■ ■ ,p} and the parental set by 

pa(j) = pa£,(j) = {A;; k a parent of node j} (j = 1, . . . ,p) . 

The Markov condition of Pobs with respect to the DAG D, with parental sets pa(-) = pa£)(-). 



allows the following (minimal) factorization of the joint distribution (Lauritzen, 1996): 



/obs(3^) — 1^ fohsi^j\Xpa(j)) J 



(3) 



are univariate Gaussian 



where /obs(') denotes the Gaussian density of Fobs and fohs{xj 
conditional densities. 

The interventional distributions P-^^ [i = l,...,nint) may all be different but linked to 
the same observational distribution Pobs and the same DAG D via the intervention calculus 
Due to the common underlying model given by Pobs and the DAG D, this 



in Section 2.1.1 



allows to handle cases where we have only one interventional data point for every interventional 
distribution. 



2.1.1 Intervention calculus 



The intervention calculus, or do-calculus (Pearl, 2000), is a key concept for describing the model 



of the intervention distributions. We consider the DAG D appearing in the observational model 
(U, and we assign it a causal interpretation as follows. Assume Xjnt is realized under a (single- 
or multi- variable) intervention at the intervention target / C {1, . . . ,p} denoting the set of inter- 
vened vertices. The distribution of X^^t is then given by the so-called truncated factorization, a 
version of the factorization in Q . The truncated factorization for the interventional distribution 
for Xint with deterministic intervention do{Xj = uj) is defined as ( [Pearl 2000): 



/int(a:/<=| do{Xi = ui)) = /obs(a;j|2;pa(j)n/'=: ^pa(i)n/)j 

i0 

where /int('|do(X/ = uj) is the intervention Gaussian density when doing an intervention at 
Xi by setting it to the value u/, and /obs('l') is as in ([s]). Here, the conditioning argument 
^pa(j)n/'=) '"pa(j)n7' distinguishes the value of the unintervened variables Xpa(j)n/': and the values 
of the intervened variables tipa(j)n/- 

Deterministic interventions as described above make the intervened variables Xj deter- 
ministic, having the value of the intervention levels u/. In this paper, we consider stochas- 
tic interventions where the intervened variables Xi are set to the value of a random vector 
Ui ~ Wj^i fUj{uj)(iuj with independent (but not necessarily identically distributed) compo- 
nents having densities fuji') {j £ I)- The truncated factorization for stochastic interventions 
(where the intervention values are independent of the observational variables) then reads as 
follows: 

/int(x| do{Xj = Ui)) = J|/obs(a;j|xpa(j)n7<:, Ups,(j)ni)Y{fUj{xj). (4) 

HI je/ 

In contrast to the case of deterministic interventions above, the intervention density ([4]) is 
p-variate: x £ MP and for j £ I, xj is then an argument in the density from the random 
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intervention variable Uj. In the following, we assume that the densities for the intervention 
values are Gaussian as well: 



Ui, . . . ,Up independent with Uj ~ M{^Uj , 'Tj) {j = ■ ■ ■ ,p)- (5) 

The truncated factorization in Q or its deterministic version above can be obtained by 
applying the Markov property to the interventional DAG Dj: given a DAG D, the intervention 
DAG Dj is defined as D but deleting all directed edges which point into i £ I, for all i £ I. 

An interventional data point X^^^_, with intervention target T^*) = / C {1, . . . ,p} and cor- 
responding intervention value Uj^\ then has density fint{x\ do{Xj^^ = C/}*^)) from mI. Thus, in 

(i) f 

other words, the intervention distribution Pj^^^ is characterized by the Gaussian density in (4). 



This, together with the specific form of the Gaussian observational distribution (see also (|3| 
fully specifies the model in ([T]) which then reads as: 



^int ' ■ • • ' ^iT'^ independent, and independent of X^]^^, X^^f'\ 



^SL • • • . ^obs'^^ i-i-d- ~ /obs(x)dx as in (§, 

'int ' ■ • • ' ^iS"*^ independent, 
Xi2 ~ /int(x| do(Xy(.) = U^l))dx as in (§ 

. . . , independent, and independent of X^^^, X^^f =\ 

C/»~AA(/.»,diag(rp,...,r»2)) (6) 

The true underlying parameters and quantities in the model (6j) are denoted by fiQ, Sq, /^oly, 

{TQ*j^}j and the true DAG Dq. It is well known, see also Section [sj that Dq is typically not 
identifiable from the observational and a few interventional distributions. 

2.1.2 Structural equation model 

The model in ([g]) (or in ([T])) can be alternatively written as a linear structural equation model 
thanks to the Gaussian assumption. The observational variables can be represented as 

p 

Xohs,k = ^ PkjXohsj + £k, £k ~ AA(0, al) {k = 1, . . . ,p), (7) 
i=i 

where = if j ^ pa(fe) = pa£){k) and ei,...,e„ are independent and independent of 
^obs,pa(fc)- Using the matrix B = {/3kj)l j^i with 

B e B{D) := {A = {akj) G RP^p ; akj = if j ^ P^Dik)}, (8) 

we can write 

Xohs = BXohs + e, e ~ A/'p(0, diag(cJi, . . . , al)). 

An interventional setting with intervention do(X/ = Uj) (and intervention target T = I) 
can be represented as follows: 

^ _ /Ej^/ Pkj Xintj + T.jei hjUj + efc , if A; ^ /, 

^int,fc — \ jj -f J ^ T ^ ' 

with /3fcj and Sk as in ([T]) with the additional property that U is independent of Xobs and e. 
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Thus, the model in (|6| is given as 

411- --^41^^ i-i-d- asinQ, 

X^^l Xi^";"*^ independent, and independent of X'-^\ . . . , X^^^^s)^ 

X-^j^l as in Q with intervention target / = r*-*-* , 

U^^\ . . . , independent, and independent of X^^i' • • • ' -^ibf 

C/«~AA(4^diag(r«^...,r»2)) (10) 

It also holds that for e in ([7|, e^^\ . . . ,e*-"'^ are independent of U^^\ . . . , U^"'\ As before, we 
denote the true underlying quantities by Bq, {(Jq j^}k, Aio,c/) Tq and the true DAG Dq. Because of 
the causal interpretation of the DAG model, we call a model as in ( 10 ) or ([6]) a Gaussian causal 
model in the following. 

2.2 Maximum likelihood estimation when the DAG is given 

The likelihood for the Gaussian model (|6]) is parameterized by the covariance matrix S of 
^obs = A/'p(0, S), the DAG D and the parameters fl^\ r*^*)^ for the stochastic intervention values 
C/*^*). Alternatively, and the route taken here, we can use the linear structural equation model 
and parameterize the likelihood with the coefficient matrix B, the error variances cr^, . . . ,(Tp, 

and yUj^^T^*)^. Using this, the matrix B is constrained such that its non-zero elements are 
corresponding to the directed edges in the DAG D. 

For a given DAG D, it is rather straightforward to derive the maximum likelihood estimator, 
as discussed below. Much more involved is the issue of structure learning when the DAG D is 
unknown: there we want to estimate a suitable Markov equivalence of the unknown DAG, as 
discussed in Section [Sl 

(i) (i)2 

It is easy to see that the log-likelihood for /x^ , rj decouples from the remaining parameters, 

and we regard Tj^^'^ (for all i and j) as nuisance parameters. 

In the sequel, we unify the notation and denote an observational data point with the inter- 
vention target 1 = 0. We can then write the distribution of X[^t \ do(Ar/ = Uj) as: 

X|do(X/ = [//) ~ AA(/x(^),s(^)), (11) 



(l-R(^)B)-'Q(n^f,u„ 



S(^) = (I - R('^B)'^ diag(a2)i?(^) + Q(^)T diag(T2)Q«] (l - 
Thereby, we have used the following matrices: 



^p^rI^I, x^xi, (12) 



The Gaussian distribution in (11) is a direct consequence of which can be rewritten in 
vector-matrix notation as 

Xint = R^^^ {BXi^t + e) + Q^^^^U . 

Denoting the intervention target for the ith data point X^*^ by T^*^ and the total sample size 
as n = nobs + ^int) the log- likelihood (conditional on U^^\ . . . , [/(")) becomes 

n 

iD{B, {alh, {r»2},; , X('\ X^")) = ^ log /^(^Wj^^t^W) (X^'^), 

i=l 
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where /^(^CO) ^crt*)) denotes the density of J\f{fi^'^^''^\ S^'^*'')) in (11) which depends on B, {(T^}a:, 

^^d {t^*-*^}^. To make notation shorter, we will denote by 7" the sequence of intervention 
targets T^^\ . . . , T^") in the following, and by X the data matrix consisting of the rows to 

For a given DAG structure D, implying certain zeroes in S G B(D) through the space B(D) 
m ^, the maximum likelihood estimator is defined as: 



BiD),{aliD)}, 



argmin -^^(S, {dj }i; T, X). 

BeBiD) 



(13) 



6.1 



the nuisance 



The expressions B{D), {o"fc(-D)^}fc have an explicit form as described in Section 
parameters {T^^^^jj do not appear in (13) any more since the minimizer of the likelihood 

does not depend on them. 



3 Estimation of the interventional Markov equivalence class 



Consider the model in (|6]) or (10). It is well known that one cannot identify the underlying 
DAG Dq from Fobs = Pq. obs- However, assuming e.g. faithfulness of the distribution as in (p|), 
one can identify the observational Markov equivalence class £{Dq) = £{Pobs) from Pobs) see for 



example Spirtes et al. (2000) or Pearl (2000) 



3.1 Characterizing the interventional Markov equivalence class 

The power of interventional data is that we can identify more than the observational Markov 



equivalence class, namely the smaller interventional Markov equivalence classes (Hauser and 



Biihlmann, 2012). Regarding the latter, we consider a family of intervention targets, a subset 
of the powerset of the vertices {!,..., p}: X C ^({1, . . . In our context X = {T^*) C 

{1, . . . z = 1, . . . , n} is the set of intervention targets of the riint interventional data together 
with the empty set as long as we have at least one observational data point (nobs > 0) . 

A family of targets X is called conservative if for all j G { 1 , . . . , p} , there is some / G X such 
that J ^ /. The simplest such family is X = {0}, i.e., observational data only. Furthermore, 
every X arising from an ensemble of observational and interventional data is a conservative 
family of targets as well. The issue that a family of targets should be conservative is crucial for 



characterization of interventional Markov equivalence classes (Hauser and Biihlmann, 2012), and 



with having jointly observational and interventional data in mind, it is not really a restriction. 

The example in Figure [l] shows three DAGs that are observationally Markov equivalent since 
they have the same skeleton (i.e., they yield the same undirected graph if all directed edges are 
replaced by undirected ones) and the same v-structures (i.e., induced subgrahps of the form 
a — >6< — c) (Verma and Pearl, 1990). If we have, in addition to observational data, data from an 



intervention at vertex 4, the orientations of the arrows incident to the intervened vertex become 
identifiable. Technically speaking, the interventional Markov equivalence class under the family 
of targets X = {0, {4}} is smaller than the observational Markov equivalence class. 



The general definition of an interventional Markov equivalence class is given in Section 6.2 
The interventional Markov equivalence class £x{Dq) is identifiable from Po,obs in ^ ^-nd the 
interventional distributions, given by fint{x\do{Xi = U) )dx in (|6|) for all / G X, assuming 



faithfulness as in assumptions (Al) and (A2) below. In Hauser and Biihlmann] ( |2012 ), the 
interventional Markov equivalence class of a DAG D for a conservative family of intervention 
targets X is rigorously characterized in terms of a chain graph with directed and undirected 
edges, the so-called interventional essential graph or X-essential graph. 
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5 > 6 7 

(a) D 



5 > 6 7 

(b) D, 



5 > 6 7 

(C) D2 



Figure 1: Three DAGs having equal skeletons and a single v-structure, 3 — > 6 < — 5, hence being 
observationally Markov equivalent. Under the family of intervention targets I = {0,{4}}, D 
and Di are still (interventionally) Markov equivalent (i.e., statistically indistinguishable), while 
D and D2 belong to different interventional Markov equivalence classes. 



3.2 Structure learning using BIC-score 



For estimating the structure and the parameters of the interventional Markov-equivalence class, 
we consider the penalized maximum-likelihood estimator using the BIC-score. Denote by B{D) 



and {d'1{D)}k the maximum-likelihood estimators for a given DAG D, as in (13). An estimate 
for the interventional Markov-equivalence class is then: 



4 = argmin-^z?(^(Z?),{<T^(L»)}fc;r,X) + - log(n) dim(fx(Z))), 
£x(D) ^ 

d[m{£x{D)) = dim(D) = number of non-zero elements in B{D). 



(14) 



The optimization is over all interventional Markov equivalence classes with corresponding DAGs 



D, see also Section [3^ below. 

We note that the ^o-psnalty has the property that the score remains invariant for all mem- 
bers in the interventional Markov equivalence class £x{D): this property is not true for some 
other penalties such as the ^i-norm. We outline in Section |3.3| a computational algorithm for 



computing the estimator in (14). 



We now justify the estimator in ( 14 ) by providing a consistency result. We make the following 
assumptions. 

(Al) The true bservational distribution Po,obs in ([2|, or equivalently the distribution of Xobs ~ 
/obs(ic)dx in ([6| is faithful with respect to the true underlying DAG Dq. 



(A2) The true interventional distributions of X^^^ ~ /int(a;| do(Xy(i) 



faithful with respect to the true underlying intervention DAG D^ 



(for the definition of the intervention DAG, see Section 2.1.1 



[/»))dx in 
for all i = 1, , 



\Q\ are 



The faithfulness assumption means that all marginal and conditional independencies can be 
read off from the DAG, here Dn or Dn 



respectively (Spirtes et al. , 2000). This is a stronger 



requirement than a Markov assumption which allows to infer some conditional independencies 
from the DAG Dq or Dq . 

In our case with a data set arising from different interventions, we do not have identically 



distributed data, as it is evident for example from equation (11). To be able to make a precise 



consistency statement for the estimator ( 14 ) , we regard the sequence of intervention targets as 
random: 

(A3) The intervention targets T^^) , . . . , T^") are n i.i.d. realizations of a random variable T 
taking values in X: P[T = /] = tf/ > for all / G X. 



2.2 



In Section 
parameters? 
(i.e., the interventions) 

nuisance parameters describing the experimental setting 



we have already seen that the parameters , t- ' [loi aii i ana j } are nuisance 
They do not belong the statistical model, but describe the experimental setting 
With assumption (A3), we introduce an additional, "artificial" set of 

By this approach, we can model the 



sequence {T^^\ X^'''))^^-^ as independent realizations of random variables {T,X) G I x MP with 
the following distribution: 

P[T = I]= Wl, f{x\T = I) = /int(x I do(X, = Ul)) . 



Theorem 1 Consider model ^ with the family of intervention targets I. Assume (Al), (A2) 
and (A3). Then: as n ^ oo, 

¥[£x = £i{Do)] ^ 1. 



A proof is given in Section 6.3 The result might not be surprising in view of model selection 
consistency results of BIC for curved exponential family models (Haughton, 1988). However, 



a careful analysis is needed to cope with the special situation of data arising from different 
interventions and hence different distributions. 

Remark 1. A version of Theorem [l] also holds without the faithfulness assumptions (Al) and 
(A2). 

We define an independence map as a DAG D* such that the observational distribution in ^ 
(or equivalently the distribution of Xobs ~ /obs(2;)dx in Q) and all interventional distributions 
of ~ fint{x\do{Xj,(i) = C/(*)))dx in (g), for all T^'\ can be generated by D* and the 
corresponding intervention DAGs D*,iy This is a generalization of an independence map for 



observational data (Pearl, 1988). A minimum independence map is an independence map having 
a minimum number of edges. A minimum independence map is typically not unique, and 
assuming faithfulness in (Al) and (A2), the set of all minimum independence maps equals the 
interventional Markov equivalence class ExADq) with Z = {T*^*^; i = 1, . . . , nint}- 



Instead of ( 14 ) consider the estimator 



D = argmin -tD{B{D), {ai{D)]k; T, X) + - log(n) dim(L»), 
D ^ 

where the optimization is over all DAGs D. The statement in Theorem [T] can then be replaced 
by: 

P[L' is a minimum independence map] — )■ 1. 



Remark 2. Although we have data sets with both observational and interventional data in 
mind, note that Theorem [6] only makes the assumption of a conservative family of intervention 
targets. In other words, consistent model selection is even possible with interventional data 
alone. 

Let / G X\ {0} be an intervention target, and denote by uj = \{i; T^*) = I, i = 1, . . . , ?7-}| the 
number of interventional data for this target. Assumption (A3) made in the theorem implies 
n/ X n — )• oo. This might not be realistic in practice since there is often only one (or very 
few) interventional data point for each target /, i.e., n/ = 1 (or n/ is small). Without having 
a rigorous proof, the consistency result of Theorem [T] is expected to hold if the intervention 
value is far away from zero, i.e., far away from the mean of Xrp^i). The heuristics can be 
exemplified as follows. 

Example 1. Consider a DAG Dq = 1 — > 2 with p = 2 and corresponding observational 
distribution from the structural equation model 

Xi~AA(0,a?), 

X2 ^ /SXi +£2, £2 ~ AA(0,(Ti). 
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Then, the interventional distribution with target 1=1 equals: 



X2|do(Xi = u)^M{pu,al] 



(15) 



whereas the marginal observational distribution is 



(16) 



Thus, if — )• 00, the means of the distributions in (15) and (16) drift away from each other 



and one realization from the intervention in (15) would be sufficient such that with probability 
tending to 1 as u — )■ 00, we could detect the difference from (one or many realizations of) the 



observational distribution in (16). 

Alternatively, if u = 0, we could detect differences of the distributions in (15) and (16) in 



terms of their variances. But we would need many realizations from (15) and (16) to detect this 
difference with high probability. 

Although obvious, we note that if the true DAG would be 1 < — 2, the distribution in (15) 



and (16) would coincide (being equal to AA(0, Var(X2)). Therefore, when doing an intervention 
do(Xi = u) and we see a difference in comparison to the marginal distribution of X2, the true 
DAG must be 1 ^ 2. 



We refer to empirical results in Section 4.2 which confirm good model selection properties if 
TT-obs is large, n/ = 1 but with intervention values U chosen sufficiently far away from zero. 



3.3 Computation 



The computation of the estimator in ( 14 ) is a highly non-trivial task. The main difficulty comes 



from the fact we have to optimize over all Markov equivalence classes. We can reformulate the 
optimization as follows: 



1 



arg min -£(5, {ai}k; T, X) + - log(n) dim(B) 



where — ^(•;T, X) is the negative log-likelihood in the model (10), and Bdag is the space of 



matrices satisfying the constraint that they correspond to a DAG. This DAG-constraint causes 
the optimization to be highly non-convex. In view of this, the £o-penalty is not adding major 
new computational challenges (and in fact allows for dynamic programming optimization, see 
below) while it enjoys nice statistical properties and leading to a score (value of the objective 
function) which is the same for all DAG members in an interventional Markov equivalence class. 



Somewhat surprisingly, although the optimization problem in (14) is NP-hard (Chickering 



1996') , dynamic programming can be used for exhaustive optimization ( Silander and Myllymak: 



2006p , roughly as long as p is less than say 20. For problems with larger dimension, the opti- 



mization in (14) can be pursued using greedy algorithms. Based on the idea from Chickering 



( |2002a|bD , one can use a greedy forward, backward and turning arrows algorithm which pur- 
sues each greedy step in the space of interventional Markov equivalence classes which is the 
much more appropriate space than the space of DAGs. An efficient implementation of such 
an algorithm, called Greedy Interventional Equivalent Search (GIES), is rigorously described 



m 



Hauser and Biihlmann (2012) where algorithmic properties, theoretical and empirical, are 



reported in detail. Although there is no guarantee that GIES converges to a global optimum, it 
seems very competitive and keeps up with dynamic programming for small-scale problems. An 
implementation of GIES is available in the R-package pcalg which is used throughout in Section 

H 
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Figure 2: ROC plots of the models estimated from the Sachs data set, for directed edges (a) 
and the skeleton (b). In (a), GLASSO is missing since it does not yield a directed model; in (b), 
random guessing is shown by a solid line. 



4 Empirical results 

We evaluated ^o-psnalized maximum likelihood estimation of interventional Markov equivalence 



classes as described in Section^ on a real data set (Section 4.1) as well as on simulated data 



(Section 4.2). 



4.1 Analysis of protein-signaling data 



We analyzed the protein-signaling data set of Sachs et al. (2005). This data set contains 7466 



measurements of the abundance of 11 phosphoproteins and phospholipids recorded under differ- 
ent experimental conditions in primary human immune system cells. The different experimental 
conditions are characterized by associated reagens that inhibit or activate signaling nodes, cor- 
responding to interventions at different points of the protein-signaling network. Interventions 
mostly take place at more than one point, and the data set is purely interventional. However, 
some of the experimental perturbations affect receptor enzymes instead of (measured) signaling 
molecules. Since our statistical framework cannot cope with interventions at latent variables, 
we only considered 5846 out of the 7466 measurements which had an identical perturbation of 
the receptor enzymes. In this way, we model the system with perturbed receptor enzymes as 
"ground state" , defining its distribution of molecule abundances as observational. 

While we can make the data set fit our interventional framework by the aforementioned 
reduction to 5846 data points, the linear-Gaussian assumption of our framework may not hold, 
even after a log-transformation of the measurements. Nevertheless, we fitted graphical models 
to the data set with different frequentist methods: GIES for the £o-penalized MLE in (14) 
(see also Sections 3.3 and 4.2.1), the PC algorithm (Spirtes et al. , 2000), the graphical LASSO 



(GLASSO, [Friedman et al. 2007), and GIES combined with stability selection (Meinshausen 



and Biihlmann, 2010). We varied the tuning parameter of each algorithm: the number of 



steps (i.e., of edge additions, deletions or reversals) in GIES, the significance level a in the PC 
algorithm, the penalty parameter A in GLASSO, and the cut-off selection probability in stability 
selection applied for GIES. We compared the estimated models to the conventionally accepted 



model which serves as ground truth (Sachs et al. , 2005); the resulting ROC plots, both with 



respect to edge directions (defining true and false positives in terms of the graphs' adjacency 



matrices) and with respect to the skeleton alone, are depicted in Figure 4.1 



The overall performance of the estimation of the skeleton is comparable for all four algorithms 
(Figure [4rT| b)), even if two of them (PC and GLASSO) treat all data as identically distributed 
and disregard its interventional nature. Regarding edge directions (Figure [4T| a)), however. 
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GIES (with or without stabiUty selection) yields an improvement over the PC algorithm. 



The Bayesian method of Cooper and Yoo (1999) used for model fitting by Sachs et al. (2005) 



is not directly comparable to the frequentist methods used here. In particular, the results from 



Sachs et al. (2005) are not easily reproducible due to choosing discretization levels and prior 
distribution. Their performance as measured by comparison to the ground truth is substantially 
better than all methods considered in this paper (15 true positives, 7 false positives in the 
convention of Figure |4.1| ^a)). Potential reasons are increased robustness due to discretization and 
specific tuning (which is legitimate in their context of extending and improving the conventional 
ground truth). 

4.2 Simulations 



We performed ^o-psnalized maximum likelihood estimation as in (14) on interventional and 
observational data simulated from 4000 randomly drawn Gaussian causal models (see ^ or 
([To]) ) to illustrate the consistency result of Theorem [TJ 



4.2.1 Experimental Settings 

We randomly drew DAGs whose skeleton has an expected vertex degree of 1.8, 1.9, 2.9 and 3.9 
for p = 10, 20, 30 and 40, respectively. For every DAG D, we randomly generated a weight 
matrix B G B(D) and error variances (7^,...,a"p such that the corresponding observational 
covariance matrix 

S = Cov(Xob.) = (I - By^ diag(a2)(I - B)-T 

had a diagonal of (1, . . . , 1), meaning that each variable of the system had an observational 
marginal variance of 1. The procedure for generating Gaussian causal models of this form has 



been described in detail by Hauser and Biihlmann (2012) 



We simulated data sets with a total sample size n = nobs + "-int between 50 and lO'OOO. 
We performed single- vertex interventions at k randomly drawn vertices {k = 0.2p, k = 0.5p 
and k = p), drawing p/k samples under each intervention (that is, 5, 2 or 1 for the chosen 
values of k) . These settings ensure that we had only nint = P interventional data points in each 
simulation, and that the majority of the data points were observational ones thus (jiobs = n—p). 
This allowed us to verify our conjecture following Theorem [T] that few interventional samples are 
sufficient for consistent estimation of interventional Markov equivalence classes (or, equivalently, 
interventional essential graphs) as long as the intervention levels, the expectation values of the 
intervention variables U, are large enough. In our simulations, we chose expecation values 
between 1 and 50 and variances of t"^ = (0.2)^ for the intervention variables. Note that because 
of the chosen normalization Tin = 1; the expectation values fijjj can be thought of as being 
indicated in units of observational standard deviations. 

To sum up, for each of the 4000 randomly generated Gaussian causal models, we simu- 
lated 144 data sets with observational and interventional data, namely one data set for each 
combination of the following experimental parameters: 

• n G {50, 100, 200, 500, 1000, 2000, 5000, 10000}; ni„t = p, nobs = n - p; 

• k G {0.2p,0.5p,p}; 

• flu, G {1,2,5,10,20,50}. 

We learned the structure of the underlying causal model from the simulated data sets using 



the BIC score as described in Section 3.3 We used the two causal inference algorithms mentioned 
in Section [331 



an adaptation of the dynamic programming approach of Silander and Myllymaki (2006) 



to interventional data which will be abbreviated as SiMy in the following. This algorithm 
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guarantees to find the global minimizer of the BIC in (14); because of its exponential 



complexity, it is only applicable for models with no more than 20 variables though. 



the Greedy Interventional Equivalence Search (GIES) of Hauser and Biihlmann (2012). 



This algorithm greedily optimizes the BIC score by traversing the search space of inter- 
ventional Markov equivalence classes through operations corresponding to edge additions, 
deletions, or reversals in the space of DAGs. The algorithm does not guarantee to find 
the optimum of the BIC, but it was empirically shown for graphs with up to p = 20 nodes 



to have a performance comparable to that of SiMy (Hauser and Biihlmann, 2012) while 
having polynomial runtime in the average case. 

We assessed the quality of the estimated causal models with the structural Hamming distance 



SHD (Tsamardinos et al. , 2006 we use the slightly adapted version of Kalisch and Biihlmann 



2007). This quantity is a metric on the space of graphs. The SHD between two graphs G 



and G is the sum of false positives and false negatives of the skeleton and wrongly oriented 
edges. Formally, if the graphs G and G have adjacency matrices A and A, respectively, the SHD 
between G and G is defined as 

SHD(G,G'):= I 

l<i<j<p 



1-1 



) 



4.2.2 Results 

Figure [3] shows the SHD between estimated and true interventional essential graph as a function 
of the total sample size n. Results for different numbers of intervention targets showed similar 
characteristics (not shown). The plots illustrate the consistency of the BIC, the main result 
of Theorem [T] As expected, convergence to the true equivalence class is faster the larger the 
intervention values (controlled by fiu) are. Note, however, that the simulation setting does not 
fully match the limit setting of the theorem: while the theoretical result asks for the sample sizes 
nj of all interventions / G X to grow in the order 0(n), we always have p interventional data 
points in our case while only the number of observational data points is growing. In the setting 



with nj X n, Hauser and Biihlmann (2012) have already empirically shown the performance of 
GIES as well as SiMy. 

Figure |4] supports our conjecture stated after Theorem [Tj even with few interventional data 
points (a total of p in our case, compared to n — p ^ p for n = 1000), the estimates of the 
causal models are substantially improved by only increasing the mean intervention values fiu- 
However, for p = 10, this effect is not clearly visible. 



5 Conclusions 

We have proposed a likelihood framework for joint modeling of Gaussian interventional and 
observational data. Such kind of data arises in many applications, notably in biology with 
measurements of wild-type individuals and modifications arising from interventional knock-outs 
of some genes. Our likelihood approach has various interesting aspects which we summarize 
as follows. The parameters in the model are the observational directed acyclic graph (DAG) 
D and the corresponding edge weights B and error variances {crf}i (or instead of B and {crf}i 
the corresponding covariance matrix of a Gaussian distribution). These parameters are global 
in the sense that every intervention distribution is determined by these parameters via the do- 
calculus: in particular, this implies that only one or a few data per intervention suffice for 
reasonably accurate estimation since the corresponding distributions are all linked to the global 
parameters. 

We show here that the BIC is consistent for estimating the corresponding interventional 
Markov equivalence class. The proof is rather involved since the various intervention distribu- 
tions are not identical and do not easily fit into a standard setting. The interventional Markov 
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Figure 3: SHD between estimated and true interventional essential graph as a function of the 
sample size n for different numbers of variables p. In each simulation, p interventional data points 
were used, 2 replicates for p/2 single- vertex intervention targets. Interventions were performed 
with an expectation value of fxu = 2 (upper row) and fj,u = 10 (lower row), respectively. 
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Figure 4: SHD between estimated and true interventional essential graph as a function of the 
intervention mean ^jj. Results for simulations with a total sample size of n = 1000, of which p 
data points originate from interventions at 20% of the vertices. 
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equivalence class is an interesting and realistic target: it is smaller than the standard observa- 
tional Markov equivalence class and it leads to a higher degree of identifiability when intervening 
at several variables. This has direct implications for tighter bounds for inferring causal effects 



(Maathuis et al., 2009). 



Besides the methodological development and theoretical derivations, we present empirical 
results for real and simulated data. 

6 Derivations and proofs 



This section contains all proofs left out in earlier sections, namely the derivation of the maximum 



likelihood estimator for a given DAG (Section |6.l| proving results of Section 



2.2l), and the proof 



of the consistency result for model selection (Section 6.3 proving Theorem IT) 




6.1 Explicit form of maximum likelihood estimator when DAG is known 

Gaussian densities form an exponential family. The joint density of Gaussian random variables 
with expectation fi and covariance S can be written as 



fMix; K, u) = (2^)-i/2 exp [{-\xx^ , K)sp + {x, u) 



log det K) 



(17) 



where the inverse covariance matrix or precision matrix K ■.= T, ^ and the transformed expec- 



tation value V := K^. form the natural parameters. In (17), (•, •)mp stands for the canonical inner 



product on W, and {',-)sp denotes the inner product {A,B)sp := iv{AB) on the vector space 
of symmetric p x p matrices. 



The canonical form (17) of the exponential family of Gaussian distributions eases calculations 



with the interventional distributions ( 11 ), especially for our goal to derive a maximum likelihood 



estimator for a causal model with interventional data originating from different interventions. 



We hence start by calculating the natural parameters for the interventional distribution (11). 



To simplify later calculations, we use the inverse error variances 7^ := o"^ to parameterize a 
Gaussian causal model from here on, together with the vector notation 7 := (71, . . . , 7p). 

Lemma 2 Let fi^^^ and S^^^ be the expectation and covariance of the interventional distribution 



(11), respectively. Then the following identities hold: 

KiD ^^{i)y^ = (I _ 5)T^(7) diag(7)i2(^)(I -B) + Q^'^^ K^'^Q^''> 



log det K^''> = Y.k0 log Ik + log det K^'^ 



We make use here of the notation K^^^ := (S(^)) ; S(^) := diag(r2) is the covariance matrix 
of the intervention variable Uj . 



Proof To prove the formulae, we use the following identities of the auxiliary matrices (12): 



(18) 



To verify the claimed formula for the precision matrix , it can be easily checked using the 
18j) that 

diag(a2)/?(^) + Q(^)Ts(^)q(^)1 = R(n diag(7)i?(') + Q^'^^ k'^''^ Q^'^ . (19) 



identities (18) that 
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We then find 



= (I - BfR^'^ diag(7)i?(')(I -B) + Q(^)T^(/)g(/) ^ 

where we again use several of the identities (18]) in the last step. 

By making use of equations (11) and (19) again, we can calculate the transformed expecta- 
tion: 



I - [r^'^ diag(7)i?(^) + Q(^)Ti^(^)Q(^)J Q^'^^^iu: 



the last step is again a consequence of the identities (18). 



For the next formula, we use the fact that i? is a nilpotent matrix; it is not hard to see 
that every matrix satisfying the DAG constraint actually is nilpotent. Therefore the inverse of 
(I - can be calculated as (l - R'^^^B)'^ = YlHl {R^^^Bf. Together with the identities 

(^18^ and the representation of ^^^^ in (11), we conclude that 



p-i 



k=0 



It follows that 



^(/)T (kH) 



where we used the formula for i/^^^ already proven before. 

To calculate the determinant of K^^^ finally, note that there is a permutation matrix P such 
that 



P 



is a block matrix. Hence 



i?(^)diag(7)ii(^) + g(^)^K(^)Q(^) 
det K^^^ = det K^^^ • JJ 

k0 



P' 



or logdet-R'(^) = Sfe^/log7fc + log det which completes the proof. 



□ 



Up to now, we have only considered a single interventional distribution. In the next lemma, 
we provide a formula for the likelihood of an interventional dataset originating from multiple 
intervention targets as defined in ([9]). In the following, we simplify notation by unifying ob- 
servational and interventional d ata p oint in a common framework. For this aim, we reuse the 
convention at the end of Section 



2.2 



and consider the entire data set (X^*))"^]^, n = nobs + nint, 
of all observational and interventional data points. To make notation short, we denote the 
complete data set by the matrix X, having the rows X^^\ . . . , X^'^\ and the list of intervention 
targets T^^\ . . . , T^"^ by T- Recall that an observational data point X^^"^ is marked by the empty 
target T^*) = 0. 

Lemma 3 Let (T, X) be an interventional dataset as defined above, produced by a Gaussian 
causal model with structure D. Moreover, let B G B(D) be a weight matrix and 7 G M^q a vector 



of inverse error variances. Denote by n 



{I) 



\{i I r« =/}| andS^') ■■=^l:^■.Ti^)=IX^^X(^^ 
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(empirical covariance matrix for intervention I £ I). Then the log-likelihood of {T,'X.) given 
parameters B and 7 is 



lex /ex 
= -l^n^'^ tr [S^'\l - diag(7)i?('Hl - S)] + ^ ^ n^^) log 7^ + C" 



/ex 



/ex ferf/ 



where C and C are constants given by the dataset (T, X) that do not depend on the model 
parameters B and 7. 

Note that in the case of purely observational data (that is, if T^*) = for all z), this result 
reproduces the classical log-likelihood (see, for example, [Banerjee et al. 2008) 

2£fl(S,7;(0)^^i,X) = n(logdeti^-tr(5i>C)) + C . 



Proof The likelihood of the entire data set is the product of the sample likelihoods (11): 

n 

£B(i5,7;r,x) = j;iog/(x« I do(x», = f/(!j,); 

i=l 
i=l 

_1 ^ tr + ^ 5^ log det K^^'-'^ + C 

i=l 1=1 

= --Y^ n^'^ tr (S^'^K^'A + - E n(^) log det if + C . 



/ex 



/ex 



In the calculations above, C stands for a constant that is independent of the model parameters 
B and 7 (note that, by Lemma [2| the remaining terms from Equation (17) are independent of 
model parameters). 

The second line of the lemma follows easily from the first one by applying the identities given 
in Lemma m □ 



The following lemma shows that the log-likelihood derived before is decomposable (Chicker- 



ing, 2002b) in the sense that it can be written as a sum of terms that only depend on a vertex 



and its parents. 

Lemma 4 Using the definitions rS~^^ := X^/gj-fc^/ 't-*-^^ and S*^^^^ := Yliex-ki^i the 
log-likelihood of Lemma can be decomposed as follows: 

p 

iD{B,r,T,X) = ^ik{Bk.,lk;T,X) + C, 

k=l 

'fc.,7A:;7',X) = -:^n(~'') \jk 



(I - B)k.S'^'''HiI- B)k.f -log^k 

^ L 

where C is a constant that does not depend on the parameters 7 and B. The calculation of the 
partial likelihoods £k only involves data measured at vertex k and its parents pa{k). 

Proof The decomposition of the second summand in Lemma [3] is easy to verify: 



^n^^) Elog7fe = X] X] 1°^^'= l°S7fc = X]"^ ''^ ^oS7k- 

/ex k<^I i=l k(^T^^) k=l i:k<^T(i) k=l 
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The decomposition of the first summand makes use of the fact that tr{AB) = tr{BA) for 
any matrices A and B for which AB and BA are defined: 



J2 n^^^ tr W^Xl - B fR^^^ diag{-f) R^^\l - B) 



lex 



n 

^tr diag(7)i2(^''')(I - - B)^' 



i=l 
n 



E lk{l-B)k.X'^'^X^')'^{{l-B)k.) 



k=l 



The fcth column of I — (I — B)k. only has entries at indices {k} U pa(/c), so the calculation 
only includes rows and columns of the empirical covariance matrix with those indices and hence 
only uses data from vertex k and its parents. □ 



Lemma |4] shows that, for a fixed DAG D, the maximum likelihood estimates for the weight 
matrix and the error variances can be calculated "locally" , that is only involving data of single 
vertices and their parents. 

Lemma 5 For a fixed DAG D and given data, the maximum likelihood estimate for its param- 
eters a and B are 



Bh 



i-k) 



s: 



i-k) 



k,pa(k) '~'k,pa,{k) \.^pa{k),pa(k) 

The maximum partial likelihoods are 



al = {I-B)k.S^'>'^ ({I-B 



sup 4(i?fc.,7fc;r,X) = (l + logal) 



^n(-'^)|l + log 



'qi-k) qi-k) 
•^kk 



S 



i-k) 



fc,pa(fc) y pa(fc),pa(fe) 
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i-k) 
pa(fc),fc 



Proof The maximum likelihood estimate must be a root of the derivative of the likelihood. 



From Lemma 



we see that q^.^ = o^.^k for i = 1, . . . ,p. This partial derivative is 

h{Bk., Ik; T, X) « (I - B)k.S^r^^ = si;^^ - BkM^''^ . 



dB, 



ki 



(20) 



For a fixed /c, B^. has one non-zero entry for every parent of k in the DAG D. For those entries, 
we get the system of linear equations 



B 



qi-k) 

fc,pa(fc)'-'pa(fc),i 



i-k) 
ki 1 



V i G pa(A;), 



by setting the partial derivatives ( 20 ) to zero. In matrix notation, this reads 



B 



fc,pa(fc)'^pa(fc),pa(fe) ~ '^fc,pa(A;) 



i-k) 



S\ 



i-k) 



and has the solution 



note that is invertible almost surely if > \ pa(A;)|. 



B 



fc,pa(fc) 



s\ 



i-k) 



S. 



i-k) 



k,psi{k) \ pa(fc),pa(fc) 



fc,pa(fc) 
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The derivative with respect to the error variances is 

^ - " 7fc) oc (I - B)k.S^-'^^ ((I - B^:^ ^ 



oik Ik 
and has the inverse root 

^=al = {l-B)k.S'^-^^ ({l-B)S 
Ik ^ ^ 

^g{-k)_g{-k) fg(-k) 

kk A;,pa(fc) I pa(fc),pa(fc)y pa(fc),fe' 

By plugging this into the formula of Lemma |4| we immediately find the formula for the supre- 
mum of the partial likelihoods. □ 



6.2 Definition of interventional Markov equivalence class 

The observational Markov equivalence class of a DAG can be described as follows. For a DAG 
D, denote by M.{D) = {/; / Markov with respect to D} all distributions which are Markov 
with respect to D. Thereby, the Markovian property is meant to be the factorization property 
as in ([3]), and we denote by / the density of the p-dimensional Gaussian distribution. Two DAGs 
D ^ D' are Markov equivalent, if and only if M{D) = M{D'). The observational equivalence 
class of a DAG D is then denoted by £{D) which can be represented as an essential graph which 



is a chain graph with directed and undirected edges (Andersson et al. , 1997). 



For the interventional Markov equivalence class, we proceed as follows. For a DAG D, 
consider the corresponding intervention DAG Dj where we remove all edges which point from 
pa(/) to /. Furthermore, consider a family of intervention targets X and corresponding tuples 
of densities (//)/gx, where each element corresponds to an intervention target I £l. Let 



MxiD) = {ifi)iex; 



V/eX: fiGMiDj), 
V/,JgX, Vi^/UJ 



and 

//(Xj|Xpa,^(j)) 



fj{Xi\Xp^^(^i))}. 



Two DAGs D and D' are interventionally Markov equivalent with respect to the family of 
targets X (notation: D ~j D') if and only if Mx{D) = Mi{D') (Hauser and Biihlmann, 2012). 
For a DAG D, the interventional Markov equivalence class with respect to X (or X-Markov 
equivalence class) is denoted by [D]x which, as in the observational case, can be characterized 
by an essential graph £x{D) (Hauser and Biihlmann, 2012). For X = 0, the definition coincides 



with the observational Markov equivalence class above. Although the definition of interventional 
Markov equivalence is somewhat cumbersome, the defined object indeed represents the DAGs 
which are equivalent and non-distinguishable from the interventional distributions (and if X 
also contains the 0-target, from observational and interventional distributions). In other words, 
assuming faithfulness as in ([2]) , the X interventional Markov equivalence is identifiable from the 
distributions. 



6.3 Proof of Theorem [T] 

In the previous section, we calculated the maximum of the likelihood of causals models given a 
set of interventional data (T, X). For model selection, that is, estimating the causal model that 
produced a given dataset, the model complexity has to be penalized to avoid overfitting. For 
large interventional (and potentially observational) samples, it stands to reason to choose the 
complexity penalty of the Bayesian information criterion (BIG). 

The maximization of the BIG of a growing sequence of i.i.d. data is known to lead to consistent 



model selection from a set of curved exponential models (Haughton 1988) 



18 



Definition 1 (Curved exponential model; Haughton, 1988) LetV = {f{x;6) = h(x) ex.p[{T{x) 
b{9)] I 6 G 0} be an exponential family with natural parameter space C M^. A curved exponen- 
tial model is a set of parameters of the form M CiQ, where M is a smooth connected manifold 
embedded in M^. 

Suppose (X»)f^ ^ is a sequence of i.i.d. realizations from a density in the exponential family 
of Definition [T| and let M D Q be an curved exponential model in that family. The Bayesian 
information criterion or BIC of M n is then defined as 



n 

S{M; X) := sup log JJ f{X^''> \9) - - dim(M) log 



n 



eeMne 



1 



n sup ((T„,6l) - 5(61)) - -dim(M)logn 
eeA/ne 2 



(21) 



where T„ stands for the mean statistic T„ := ^ X]r=i T{X^'^^) and X is the data matrix having 
the samples X^^'^ as rows. 



Theorem 6 (Consistency of the BIC; [Haughton] |1988D Let MiDQ, M2n0, ...be a finite 
set of curved exponential models in the natural parameter space of an exponential family as 

in Definition^ with the following property: for each i / j, if a point in Mi is in MjD Q, then 
it is in Mi. 

o 

Assume 9 G0 and let Mi and Mj be two different curved exponential models. If 9 & Mi\Mj, 
orifOeMiH Mj with dim(Mi) < dim(M,), then 

lim Pe[S{Mi;X) > S{Mj;X)] = 1 . 



As we explained in Section 3.2 we regard the intervention targets T^^\ . . . , T^") as a random 
sequence, taking a "value" I G I with probability wj (assumption (A3) of Section 3.2). With 
this assumption, we can treat the complete data set (T^*), X^*))"^-^ as i.i.d. realizations of random 
variables {T,X) £ I x M^. Expressed in this notation, we have shown in Section 6.1 that the 
conditional densities /(x \ T = I) = fint{x \ do(X7- = [//)) belong to an exponential family. In 
the next proposition, we show that also the joint density of (T, X) belongs to an exponential 
family. 

Proposition 7 Consider a set of random variables {X, Y) £ x {1, . . . , J} with P\Y = j] = 
Wj, 1 < j < J, and (X I y = j) ~ /(• ;9j), where f{x;9) is a density from an exponential 
family: 

f{x;9) = h{x)exp[{Tix),9)-b{9)] . 



Then the joint density of X and Y is also an element of an exponential family, namely 



f{x,y;0,r]) = h{x)exp 



S{x,y), 



a{e,r]) 



The natural parameters are given by 



[,...,9])^ andv= {m,---,VJ-iV wUh rjj 



log 



UlJ 



h{9j) + b{9j). The sufficient statistic S and the log-partion function a are given by 



S{x, y) = {dy^iTix)"^ , 6y^jT{x)^, 6y^l, . . . , dy^j-i)'^ , 

J-1 

1 + ^exp (r]j + bi9j)-bi9j)) 



a{e,r,) = b{9,) + log 
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Proof A straight-forward calculation yields to the claimed result: 

f{x,y;e,r]) = Wyf{x;9y) 

= h{x)exp[{T{x),9y) - h{6y) + \ogWy\ 
J J 



= h{x) exp 
= h[x) exp 



J 



i=i 
j-i 



Y,{5y,jT{x),ej) - J2 (log ^ - b{ej) + b{ej)\ 
j=i j=i 



J -I 



= exp 



= h{x) exp 



J J-i . X 

J](5,,,r(x), - dy, log ^ - K^i) + K^J) 
j=i j=i \ J J 



+ logu;j-6(0j)] 

5(x,y), Q\+logti;j-6(0j) 



with the definitions of S{x,y), 6 and rj from above. 

To finish the calculation, we need to express wj as a function of 6 and rj: since 



J-i 



tuj = 1 - ^ u;^ = 1 - ^ exp[r/j + 6(6'^) - 6(6'j)]wj 
i=i i=i 



we find 



wj = 



J-i 



l + ^exp(7?,+6(%)-6(^j)) 
what immediately yields the claimed log-partition function a{6,r]). 



-1 



□ 



In order to prove the consistency of the BIC for causal model selection under interventions 
in the limit of large interventional samples, we must show that the models described by different 
DAGs fit the prerequisites of Theorem [6j 



We have already seen that a single Gaussian interventional density (11) is a representative 
of an exponential family with natural parameters K^^^ and v^^^ living in and MP, respectively 
(see (17)). By Proposition [7| we conclude that the natural parameter space for the complete 



family of interventions is 



where we write J := \X\. We have already seen that the interventional densities are deter- 
mined by model parameters and experimental parameters] the model parameters are B G B(D) 
and 7 E I^>o- Therefore the sets of natural parameters corresponding to different models are 
parameterized by functions 

: B(L>) X ^ 5 X V X W . 

Before showing that the images of those maps form indeed a set of embedded manifolds in 
5 X V X W satisfying the prerequisites of Theorem [6j we sum up our notation from above and 
from Lemma [2j 
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Definition 2 Let D be a DAG. Furthermore, let I be a conservative family of intervention 
targets, and T € X arbitrary. Then we define 



<!>i) : B{D) X 



BP 

^>0 



SxVxW, 



'iex\{T} 



with 



(I) 



V 

b{K,v 



log ^ - 6 [k(^) (i?, 7) , y^'^] + b [i^ (5,7), 



1 



logdetiT) . 



Furthermore, we denote the image of m 5 x V x W by M^. 

Proposition 8 With the notation from Definition the image is an embedded, smooth 
manifold in S x V x W. 

Proof We have to prove the following points: 

(i) is smooth; 

(ii) <I>-^ is injective (and hence a bijection onto its image); 

(iii) the inverse of (on its image) is continuous; 

(iv) is an immersion, that is, its derivative is injective everywhere 



Points 



and 



and 



(iv) strengthen the 



lii) say that <^jj is a topological embedding; points (i) 
result to show that is even an embedding in the sense of differential geometry. 

We will now give the (rather technical) proofs of the aforementioned four points. Throughout 
the proofs, we will always assume w.l.o.g. that the vertices of -D = {[p],E) are numbered 
according to an inverse topological sorting, such that all matrices in B(D) are strictly lower 
triangular matrices. 

(i) The smoothness of is immediately clear from its definition: is a composition of 
smooth functions. 

(ii) Let (5,7) and {B',Y) G B(L>) x R^^ such that «>^(5,7) = ^>^(5',y); by the definition 
of <I>-^, this is the case ff and only ff K^^\B, 7) = K^^\B', 7') for all I el. This condition 
simplifies to 

(1 - 5)i?(^) diag(7)i?W(l - Bf = (1 - B')R^^'^ diag(7')i?(^Hl " B'f , 

or, with the abbreviation A := {1 - 5)^^(1 - B'), 

diag(7)i?(^U-T = diag(7')i?(^) . (22) 

By the assumption made before, B and B' are strict lower triangular matrices, hence 
yl is a lower triangular matrix with ones as diagonal entries. Then, the left-hand side 
of equation (22) is an upper triangular matrix, whereas the right-hand side is a lower 
triangular matrix. We conclude that both sides of the equation must consist of a diagonal 
matrix, and that we can transpose the left-hand side: 



A-^R^^^ diag(7)i?(^) = AR^^^ diag{-/')R^^^ 



(23) 



For some a ^ I, the a*'^ column of the matrix equation (23) reads 



(^-Miag(7)).a = iA-').ala = A. aj'a = (^ diag(7')). a 



(24) 
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Since the family of targets X is conservative^ there is, for every a G [p], some / G X 



such that a ^ I] because equation 23 holds for every / G X, the column-wise equation 
(24) holds for every a G [p], so we finally find diag(7) = Adiag(7'), or, equivalently, 
= diag(7) diag(7')^"'^. Because the diagonal of only consists of ones, we see that 
7 = 7'. It follows that = 1, and because ^ is a unit triangular matrix, this means that 
j4 = 1, and hence, by definition oi A, B = B' . Therefore, is injective. 



(iii) We can restrict our considerations to the parameterizations of the precision matrices: 



(1 — B)fc jc diag(7/c)(l — Bjc jc 



K^ll = g(^)ir(^)p(^)T = -QWi3i?(^) diag(7)(pWT - i?Wi?Tp(/)T) 
= -Bijc diag(7/c) (1 - Bjcjc)'^ . 



(25) 



(26) 



By assuming, as before, that i? is a strict lower triangular matrix, (25) represents the 
Cholesky decomposition of Kji^j^. This decomposition is unique, and Bjc jc as well as 7/0 



depend continuously on Kji^jc (Schwarz and Kockler 



2006). 



For each 6 G [p], there is some 1^1 that does not contain 6 since I is conservative. Hence 
7b can be calculated out of Kji^^ by performing the Cholesky decomposition as described 

above. This shows that 7 is a continuous function of the precision matrices {K^^^)j^x- 
Assume now that a — > 6 is an arrow in D, and let / G X be an intervention target with 
b ^ I. li a ^ I, Bab can also be calculated from Kji^c via the (continuous) Cholesky 
decomposition. Otherwise, Bab is an entry of the matrix Bjjc which can be calculated by 



solving equation (26): 



which is a continuous function since the matrix inversion is continuous. Altogether, also 
the matrix B G B(Z)) is a continuous function of the precision matrices {K^^^)j^x, what 
proves the claim. 

(iv) We have to show that the derivative d^-^{B, 7) has maximal rank for all {B, 7) G B(D) x 
M^g. For that aim, we consider the canonical basis 



{(//('^'^),0)}(,,,)eijU{(0,e,)}i<,<p 
the tangent space of B(D) x M^q at the point (5,7), where H^"-''^^ denotes 



of B(D) X 

the p X p matrix with H^^'^^ 



1 and H^f^ 
''J 



for / (a, 6), and ei denotes the i 



th 



canonical basis vector of MP. We must show that 

{d'^liB, 7)(i7('^'^), 0)}(,,fe)e£; U {d<^l{B, 7)(0, e,)}i<i<p 

is a linearly independent set for all (B, 7) G B(Z?) x M^q. Again, it is sufficient to consider 
the derivatives of the precision matrices 

We start with the directional derivative of K^^^ in direction (//('''^) , 0) for a pair (a, b) G E. 
This derivative is 

dK^^\B,j){H^^'''\0) = 

- diag(7)ii(^)(l - B f - (1 - B)R^^^ diag(7)i?(^)i/("''')^ . 

For a matrix A G M^^^, the matrix H^"''^^A contains Ab. as the a*^ row; all other rows are 
filled with zeros. We then can see that 



R^^^ diag(7)i?(-^)(l - B) 



Jb{{l-B).b)' 
0, 



if 6 ^ /, 
otherwise. 
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Based on these considerations and the fact that S is a strictly lower triangular matrix, 
one can then show that, if 6 ^ I, dK(^)(S, 7)(i?('^'^), 0) = F^"'^), where 



lb 

























-1 













Ba-l,b 






• • 


• -1 -Bb+i^b • • 


• Ba-l^b 


'^Bab 


Ba+l,b ■ ■ 


• Bpi, 









Ba+l,b 













Bpb 







J 



We continue with the calculation of the directional derivative of K^^^ in direction (0,efe), 
1 <b < p. In this less tedious case, we that 



dK(^)(5,7)(0,eb) 



{1-B).bi{t-B).b) 
0, 



if 6 ^ /, 
otherwise. 



This means that, for 6 ^ /, we have dK^^\B,j){0,eb) = G^^\ where 



/o 







V 







1 —^6+1,6 
-Bh+lfi 



-B. 



pb 



-B 



pb 



J 



It can easily be seen that the matrices {F^"''^^}a>b U {G'^^)}i<fc<p are linearly independent. 
Since for each 6 E [p], there is some I £ I with b ^ I, we can finally conclude that the set 

{d'^liB, 7)(i7('^'^), 0)}(,,,)e^ U {d'^liB, j){0, e,)}i<«<p 

is linearly independent, which proves the claim. 

□ 



We have now shown that the parameter sets are smooth embedded manifolds. To be 
able to apply Theorem [6j it remains to show that two different parameter manifolds are not 
arbitrarily close. 

Proposition 9 Let I be a conservative family of targets, and let Di and D2 be two DAGs that 
are not X- equivalent. Assume that O^SxVxWisa parameter vector with 6 G and 
9 e . Then also 6 G M^^ holds. 

Proof each j, a unique parameterization {B^^\^^^^) £ B(Z))x]R^q such that 9^^^ = ^■^_^{B^^\j^^^) 
The sequence [B^^\j^^^)^^-^ must be bounded, otherwise the sequence 9^^^ = ^■^^{B^^\-f^^'^) 

could not be bounded since K^^\ I £ I, are polynomials in B and 7 (Definition [2|. By the 
theorem of Bolzano- Weierstrass we therefore have a subsequence 7(-'fe)) that converges to 

some (B,7) G B{D) x M^^q = B{D) x M^q. 
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The parameterization has a continuous continuation on B(D) x M??,q. Therefore we have 

and $1,^(5,7) = 9 holds because of the uniqueness of hmits. 

It remains to show that {B,j) G B(D) x M^q, that is, to show that 7;, 7^ for ah b £ [p]. 
Since I is conservative, there is, for each 6 G [p], some / G X such that b ^ I. From Lemma [2| 
we know that 

det i^(^) {B, 7) = det k'^^^ Yl ; 

since the prerequisite 9 G Af^^ imphes det K^^^ 7^ 0, we conclude that 7a 7^ for all a ^ I. This 
in particular implies jb 7^ 0, which completes the proof. □ 



We have now shown that the parameter sets of all DAGs D fulfill the prerequisites of 
Theorem [6j an immediate consequence is the following corollary: 

Corollary 10 Consider model ^ with the family of intervention targets I. Assume (A3) from 
Theorem\^ and the estimator 

b = argmin -iD{B{D),{al{D)}k;T,:^) + J log(n) dim(I?) . 
Then: as n —t- 00, 

¥[D is a minimum independence map] — t- 1 , 
where P refers to the probability distribution under the true model. 



As we noted in Section 3.2, every minimum independence map is X-Markov equivalent to the 
true model if the true observational and all corresponding interventional densities are faithful. 
In this case (that is, under the assumptions (Al) and (A2) of Section 3.2), the optimization 
problem in (14) almost surely has a unique solution in the limit n — t- 00, namely the X-Markov 
equivalence class of the true model (Theorem [T]). 
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